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Abstract 



We compute the expected value of the Kullback-Leibler divergence to 
various fundamental statistical models with respect to canonical priors 
on the probability simplex. This yields information about the scaling of 
model approximation errors depending on the cardinality of the sample 
spaces, and it is a useful reference for more complicated statistical models 
such as restricted Boltzmann machines. 



1 Introduction 

Let p, q be probability distributions on a finite set X. The information diver- 
gence or relative entropy or Kullback Leibler divergence 



is a natural measure of dissimilarity between probability distributions that de- 
scribes how easy it is to distinguish two distributions p and q by means of statis- 
tical experiments. In this paper we use the natural logarithm. The divergence is 
related to the log-likelihood: If p is an empirical distribution, summarizing the 
outcome of n statistical experiments, then the log-likelihood of a distribution q 
equals —n(D(p\\q) + H(p)). Hence, finding a maximum likelihood estimator q 
within some set of probability distributions Ai is the same as finding a mini- 
mizer of the divergence D(p\\q) with q restricted to A4. The value of D(p\\q) 
quantifies how well, or bad, the data can be described by q (and by M). 

Assume that A"f tme is a set of probability distributions for which we do not 
have a simple mathematical description. We are interested in finding a model 



d (p\\q) = Y1 Pi log 



Pi 
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A4 which does not necessarily include all distributions from A4 true , but which 
approximates them relatively well. What error magnitude should we accept 
from a good model? 

To assess the expressive power of a model M., we study properties of the 
function p \-> D(p\\M) = inf^^ D(p\\q). For example, the problem of finding 
the maximizers of this function corresponds to a worst case analysis. The prob- 
lem of maximizing the divergence from a statistical model was first posed, with 
different motivation, in [1 . Since then, a lot of progress has been made, notably 
in the case where M is an exponential family [5j [4] [8], but also for discrete 
mixture models and restricted Boltzmann machines [6]. 

This worst case bound is not the only aspect that decides whether a given 
model is suited, but also the expected performance and expected error are of 
interest. This leads to the mathematical problem of computing the expectation 
value 

(D(p\\M))= I D(p\\M)^(p)dp, 

J A 

where p is drawn from a probability density ip on the probability simplex, called 
the prior distribution, or prior for short. The correct prior depends on the 
concrete problem at hand and is often difficult to determine. Given certain 
conditions on the prior, we also ask, how different is the worst case from the 
average case, and how much can this behavior be influenced by the choice of 
the model? We focus on the case that the prior ip is the uniform distribution 
or a Dirichlet distribution. It turns out that in most cases the worst-case error 
is unbounded (as the number of elementary events grows), while the expected 
error is bounded. Our analysis leads to integrals that have been considered in a 
Bayesian framework for function estimation in [10], and we can take adventage 
of the tools developed there. 

Our first observation is that, if ip is the uniform prior, then the expected 
divergence from the uniform distribution is a monotone function of the sys- 
tem size N (the number of elementary events) and converges to the constant 
1— 7 & 0.4228 ~ 0.6099 log(2) as N — >• oo, where 7 is the Euler-Mascheroni con- 
stant. Many natural statistical models contain the uniform distribution, and the 
expected divergence from such models is then bounded by the same constant. 
In comparison, for randomly chosen distributions p and the expected diver- 
gence (D(p\\q)) Pyq equals 1 — 1/N. We show, for a class of models including the 
independence models, partition models, mixtures of product distributions with 
disjoint supports [6], and decomposable hierarchical models, that the expected 
divergence actually has the same limit 1 — 7, provided that the models remain 
small with respect to TV (this is the case in most applications). In contrast, the 
maximum of the divergence from these models is at least log (N/ (dim A4 + 1)), 
see [9]. For reasonable choices of the parameters, the results for Dirichlet priors 
are similar. 

In Section |2] we define the models that we are interested in and collect basic 
properties of the Dirichlet priors. Section [3] contains analytical results for expec- 
tation values of entropies and divergences from these models. The results are 
interpreted in Section [4] Proofs and calculations are deferred to Appendix [A] 
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2 Preliminaries 

2.1 Models from statistics and machine learning 

We consider random variables on a finite set of elementary events X \X\ = N. 
The set of probability distributions on X is the (TV — l)-simplex Ajv-i C M^. 
We call any subset M. C Ajv-i that can be densely parametrized a model. The 
support sets of a model M. are the support sets supp(p) = {i G X | pi > 0} of 
points p = (pi)iex in A4. 

The k -mixture of a model .M is the union of all convex combinations of any k 
of its points, M k := {YJT=i X iP {i) I A* > 0, ]T\ A; = l,pW G The k-mixture 
with disjoint supports is the subset of A4 k defined by 



supp(pW) fl supp(p^'^) = for all i / j 



Let q — {Ai, . . . , Ax} be a partition of X. The partition model A4 6 consists 
of all p G Ajv-i that satisfy pi = whenever z, j belong to the same block 
of 0. Partition models are closures of convex exponential families with uniform 
reference measure. The closure of an arbitrary convex exponential family is of 
the form (see [4]) 




A fc >0,^Afc = l|, 

k J 



where v : X — >• (0, oo) is a positive function on X called reference measure, 
and 1^ is the indicator function of A. Note that all measures ^ with equal 
conditional distributions v{-\A^) yield the same model. In fact, M QyU equals 
the if -mixture of the set {v(-\Ak) : k = 1, . . . , if}. 

For a composite system of n variables, X = X±x • • • x X n , \X{\ = iV^ for all i. 
A product distribution is a distribution of the form 

p(x 1: ...,X n )= Plfa) • • -Pn(Xn), 

where pi G A^-i- The independence model is the set of all product distribu- 
tions on a composite system. The support sets of the independence model are 
the sets of the form A = y± x • • • x y n with X C X{ for each i. 

Let S be a simplicial complex on {0, . . . , n}. The hierarchical model M.s 
consists of all probability distributions that have a factorization of the form 
P( x ) ~ rises &s(x), where $5 is a positive function that depends only on the 
^-components of x. The model Ms is called reducible if there exist simplicial 
subcomplexes Si ^2 C S such that S1US2 = S and Si DS2 is a simplex. In this 
case, the set ([jy e s 1 y) ^ (Uyes 2 ^) * s ca ^ ec ^ a separator. Ais is decomposable 
if it can be iteratively reduced into simplices. The reduction can be described 
by a junction tree (see [2 ), which is a tree (V,E) with vertex set the set of 
facets of S and such that the following holds: If (X,y) is an edge, then X fl y 
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is a separator, and if this edge is removed from the tree, then the two resulting 
trees are junction trees of two subcomplexes Si and 52 separated by X D y. In 
general the junction tree is not unique, but the multi-set of separators is unique. 
The independence model is an example of a decomposable model. 

For most models it is not possible to find a closed formula for D(-\\M), since 
there is no closed formula for arginf gG;vl D(p\\q). However, for some of the above 
mentioned models a closed formula does exist: 

The divergence from the independence model is called multi-information and 
satisfies 

n 

MI(X U . . . , X n ) = D(p\\Mi) = -H(X U . . . , X n ) + ff C 1 ) 

k=i 

If n = 2 it is also called the mutual information of X\ and X<l. The divergence 
from A4q jU equals (see [4, eq. (1)]) 

K 

D{p\\M etV ) = D(p\\ Y,P{Ak)v{x\A k )) . (2) 

k=l 

For a decomposable model M.$ with junction tree (V,E), 

D(p\\M s ) = H P (X S ) - J2 H P (X S ) - H{p). (3) 

sev seE 

Here, H p (Xs) denotes the joint entropy of the random variables {Xi} ie s un- 
der p. 

2.2 Dirichlet prior 

The Dirichlet distribution (or Dirichlet prior) with concentration parameter ol = 
(ai, . . . ,ajv)j ct% > for all i, is the probability distribution on Ajy-i defined 

by Dir a (p) := ^= l\f=i pT~ X for P = (Pu • • • ,Pn) e An-i, where T 

is the gamma function. We write a = J^Li a i- 

We will highlight especially the symmetric case (ai, . . . , ajv) = (a, . . . , a), 
which assigns no preferences to the elementary events. Observe that Dti( 1? ?1 ) 
is the uniform probability density on Ajv-i- Furthermore, it is known that 
lim a ^o Di r (a,...,a) is uniformly concentrated in the point measures (it assigns 
mass 1/N to p = 6 X , x G Af), while lim a ^oo Dir( aj _ >a ) is concentrated in the 
uniform distribution u := (1/iV, . . . , 1/iV). In general, if a G Ajv-i, then 
lim^^oo Dir KC , is the Dirac delta concentrated on a. 

The Dirichlet distributions satisfy the following aggregation property: Con- 
sider a partition g = {Ai, . . . , Ak} of X = {1, . . . , TV}. If p = (pi, . . . ,pjv) ~ 

Dir (ai> ... |Gw) , then (£ iGAl Pi> ■ • • > T,ieA K Pi) ~ Dir (E i6Al E ieAjf a<)> see > 

e.g., [3 . We write = (a^, . . . , ct e K ), a Q k = E iGAfc «i for the concentra- 
tion parameter induced by the partition g. The aggregation property is useful 
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when treating marginals of composite systems. Given a composite system with 
X = X x x ••• x X n , \X\ = AT, X k = {l,...,N k } we write cx k = (af, . . . , a% k ), 
= JZ^-czv. ^ -aOl x for the concentration parameter of the Dirichlet distri- 
bution induced on the ^-marginal (E^e*: z fc =i ^( x )> • • • ' T, xe x: x k =N k P( x ))- 
Note that J2^=i a j = a i an d moreover, if a x = 1 for all x E X, then = N/Nk 
for j = 1, . . . , iVjfe. For example, if p is drawn uniformly from the simplex of 
joint distributions Ajv-i, then the sampled marginal probability distribution 
P(Vk) = HxeX: Xk =y k P( x )i Vk £ %k is Dirichlet distributed in Ajv fc -i with con- 
centration parameter a fe = (N/Nk, • • • , N/Nk). 

3 Expected entropies and divergences 

For any fc G N let ft(fc) = 1 + ^ + • • • + \ be the fcth harmonic number. It is 
known that for large fc, 

ft(*) = log(fc)+ 7 + 0(^), 

where 7 « 0.57721 is the Enter- M as cheroni constant. Moreover, h(k) — log(fc) is 
strictly positive and decreases monotonically. We also need the natural analytic 
extension of h to the non-negative reals given by h(z) = d z log(T(z + 1)) + 7, 
where T is the gamma function. 

The following theorems present formulas for expectation values of diver- 
gences from models as well as asymptotic results. The results are based on 
explicit solutions of the integrals, as done by [10]. The proofs are contained in 
Appendix [Aj 

Theorem 1. If p ~ Dir^, then: 
. (H(p)) = h(a)-j:l 1 ^h(a i ) 

. (D(p\\u)) = log(iV) - h(a) + Eti itKai) 
In the symmetric case (ai, . . . , a^) = (a, . . . , a), 

• (H(p)) = h(Na) - h(a) 

'log(Aa) + 7 - h(a) + 0(1 /No) 
log(iV) + 0(1/ a) 
O(aN) 
h(c) + 0(a) 

• (D(p\\u)) = log (AT) — h(aN) + h(a) 

f h(a) -log(a) - 7 + 0(l/JVa) 
0(l/a) 

log(TV) + O(aJV) 
\og(N) - h(e) + 0(a) 



for large N and const, a 
for large a and arb. N 
as a — >• wif/i bounded N 
as a — > wif/i aAT = c 

/or /arge iV and const, a 
for large a and arb. N 
as a — ^ wii/i bounded N 
as a — ^ wii/i aA" = a 
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The maximum of the (Shannon) entropy H(p) = — X^P* log£>i on the prob- 
ability simplex Ajv-i is attained at the uniform distribution u, which satisfies 
H(u) = log(iV). For large N or a, the average entropy is close to the maximum 
value. It follows that in these cases the expected divergence from the uniform 
distribution u remains bounded. The fact that the expected entropy is close to 
the maximal entropy makes it difficult to estimate the entropy. See [7 for a 
discussion and possible solutions. 

Theorem 2. 

• For any q G Ajv-i; when p ~ Dir^, then 

N 

(D(p\\q)) = J2 ~ logfe)) - Ha) . 

i=i 

If a. = (a, . . . , a), then this becomes 

(D(p\\q)) = log(iV) - h(aN) + h(a) + D(q\\u) . 
When p ~ Dir a and q ~ Dir<5. , then 

• (J2iexPi l °g(<li)) = Eili f-M^i - 1) - h(a - 1), 
. (D(p\\q)) = - Eti itiH&i " 1) " h(oi)) + h(a - 1) - h{a). 
Ifa = a, then (D(p\\q)) = ^. 

• For any q G Ajv-i, when p is drawn uniformly from Ajv-i, then 

N 1 

(D(p\\q)) =~J2n l0gte) " HN) + 1 = D{uk) + 1 " 7 + ° {1/N) ■ 

i=l 

The divergence is unbounded in Ajv-i x Ajv-i, since D(p\\q) = +oo if p is 
not absolutely continuous with respect to q. Nevertheless, if p, q ~ Dir^, then in 
the limit N ^ oo the expected divergence (Z)(p||g)) remains bounded, provided 
on = ol/N is bounded from below by a positive constant. 

Consider a sequence of distributions qjy G Ajv-i, A 7 " G N. As N — >• oo the 
expected divergence (£)(-||gjv)) with respect to the uniform prior is bounded 
from above by 1 — 7 + e, e > if and only if limsup^^^ .D(^||^Ar) < £• If 
^ > ^r e_£ f° r all x G A', then < £. Therefore, the expected divergence 

(D(-\\q]sf)) is unbounded only if the sequence q^ accumulates at the boundary 
of the probability simplex, and \im^ (D(p\\qN)} < 1 — 7 + £ whenever qjy is 
in the subsimplex conv{(l — e~ e )5 x + e~ £ u} xe x- The relative Lebesgue volume 
of this subsimplex in Ajv-i is (1 — e - ^)^ -1 . 

Theorem 3. Consider a composite system of n random variables Xi, . . . ,X n 
with joint probability distribution p. If p ~ Dir^ then 
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. (H(X k )) = h(a)-^=A h (^), 

. (M/(x lr .,i n )) = (n-i)M«)+EtM«i)-E E^M"*)- 

i=l fc=l j = l 

// (ai, . . . , a at) = (a, . . . , a) (symmetric Dirichlet), 
. (ff(X fc )} = h(Na) - h(£- k a), 

• (M/(X 1; . . .,X n )) = (n - l)fc(JVa) + fe(a) - ELi M^<0- 

If, moreover, Na/N k is large for all k (this happens, for example, when a re- 
mains bounded from below by some e > and (i) all N k become large, or (ii) all 
N k are bounded and n becomes large), then: 

• (H(X k )) = log(JV fe ) + 0(N k /Na), 

• (MI(X u ...,X n )) = h(a) -log(a) - j + 0(nmax k N k /Na). 

If Na/N k is large for all then the expected entropy of a subsystem is also 
close to its maximum, and hence the expected multi-information is bounded. 
This follows also from the fact that the independence model contains the uniform 
distribution, and hence D(p\\Mi) < D(p\\u). 

Theorem 4. Let g — {Ai, . . . , Ak} be a partition of X into sets of cardinalities 
\A k \ = L k , and let v be a reference measure on X . If p ~ Dir^, then 

(D(p\\M e , v )) = ^(h(a t ) - log^)) - - log(v(A k ))), 

i=l k=l 

where a Q k = ^Z ieAk <*%. If ol = (a, . . . ,a), and (wlog) v(A k ) = L k /N, 

K I 

(D(p\\M e ,„)) = h(a) - -jjr(HL k a) - log(L fc )) + D(u\\u), 

k=l 

If furthermore N ^> K , then 

(D(p\\M Qj „)) = h(a) - log(a) - 7 + D[u\\v) + 0(1/N). 

Partition models (with v = u) also contain the uniform distribution, and 
therefore the expected divergence is again bounded. In contrast, the maximal 
divergence is ma 1 x pe / XN1 D(p\\A4 e ) = max/c \og(N k ). The result for mixtures of 
product distributions of disjoint supports is similar: 

Theorem 5. Let X = X\ x • • • x X n be the joint state space of n variables, 
\X\ = N , \X k \ = N k . Let q = {Ai, . . . ,Ak} be a partition of X into support 
sets of the independence model of cardinalities \A k \ = L k , and let Mf Q be 
the model containing all mixtures of K product distributions p^\ . . . ,p( K ^ with 
supp(>( fe )) C A k . 
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If p ~ Dir( aij _ jCKJV ) ; then the expected divergence to M\ Q is 

(D(p\\M« e )) = ]T ^f(M<*i) - h(a)) + J2(\Gk\ ~ l)^(^K) - Ha)) 



a z — ' a 

i=l k=l 

K h <r ■ 

a ' J 



EE E —(h(a^)-Ha)), 



a 

k=i jeG k Xjex jjk 

where a e k = Y, xe A k a x, a k,Xj = Y, y eA k : Vj = Xj a y> and G k C [n] is the set 
of variables that take more than one value in the block . 

• Assume that the system is homogeneous = Ni for all i and that, for 
each k, Ak is a cylinder set of cardinality \A^\ — N™ k , where rrik = \Gk\- 
If (ai, . . . , aAr) = (a, . . . , a), then 

K 

(D( P \\M* e )) = h(a) + ^iVr fc -"((m fc - l)h(N^a) - m k h(N™»- l a)). 

k=l 

N mk ~ 1 a 

• If — is large for all k, then 

171k 

(D(p\\M? ie )) = h{a) - log(a) - 7 + 0{ max -^V) . 

« T\ 1 a 

The /c-mixture of binary product distributions with disjoint supports is con- 
tained in the restricted Boltzmann machine model with k — 1 hidden nodes, 
see . Hence Theorem [5] gives bounds for the expected divergence to these 
models. 

Theorem 6. For a decomposable model A4s with junction tree (V,E), if p ~ 
Dir (cKlv .. 5Q:Ar); then 

cfi o _ _ ctj 



(D(P\\M S )) = - E E "f Maf ) + E E -f 

AT 

+ (|y|-|i?|-i)M«) + E-^)' 

i=l 

w/iere a^j = Xs =j f or 3 ^ %s- If p is drawn uniformly at random, then 

(D(p\\M s )) = ~ h (N/N s )) - ( h ( N ) ~ h (N/N s )) - h(N) + 1. 

sev see 

If N/N s is large for all S eVUE, then 

(D(p\\M s )) = 1 - 7 + 0( max -^-). 

« iV 1 a 
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4 Discussion 

In the previous section we have shown that the values of (D(p\\A4)} are very 
similar for different models M. in the limit of large TV, provided the Dirichlet 
parameters remain bounded and the model remains "small." In particular, 
if OLi — 1 for all z, then (D(p\\M)) ~ 1 — 7 holds for large N and M = {u}, for 
the independence model, for decomposable models, for partition models and for 
mixtures of product distributions on disjoint supports (for reasonable values of 
the model parameters Nk and Some of these models are contained in each 
other, but nevertheless, the expected divergences do not differ too much. The 
general phenomenon seems to be the following: 

• For a low-dimensional model M. C A^r-i and large iV, the expected di- 
vergence is (D(p\\M)) ~ 1 — 7, when p is uniformly distributed on Ajv-i- 

Of course, this is not a mathematical statement, because it is very easy to con- 
struct counter-examples: Using space- filling curves, it is possible to construct 
one-dimensional models M. with an arbitrary low value of (D(p\\ A4)} (for arbi- 
trary N). However, we expect that the statement is true for most models that 
appear in practice. In particular, we conjecture that the statement is true for 
restricted Boltzmann machines. 

In Theorem [4j if a = (a, . . . , a), then the expected divergence from 
is minimal, if and only if v = u. In this case A4 Qil/ is a partition model. We 
conjecture that partition models are optimal among all (closures of) exponential 
families in the following sense: 

• For any exponential family £ there is a partition model M. of the same 
dimension such that (D(p\\£)) > (D(p\\M)). 

The statement is, of course, true for zero-dimensional exponential families, 
i.e., models that consist of a single distribution. The conjecture is related to the 
following conjecture from [9]: 

• For any exponential family 6 there is a partition model A4 of the same 
dimension such that ma,x. pe/ \ N _ 1 D(p\\£) > max pG Ajv-i D(p\\M). 

Our findings may be biased by the fact that all the models treated in Sec- 
tion [3] are examples of exponential families. As a slight generalization we did 
computer experiments with a family of models which are not exponential fami- 
lies, but unions of exponential families. 

Let T be a family of partitions, and let Mr = U^gt-^^ ^ e tne un i° n °f 
the corresponding partition models. Our interest in these models comes from 
the fact that such models are contained in more difficult models with hidden 
variables, like restricted Boltzmann machines and deep belief networks. Figure [l] 
compares a single partition model on three states with the union of all partition 
models for bipartitions. 
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D(p\\M e ) DipWM^TlP-- 1 D(p\\\J e M e ) DipWU.M^YlPr 1 

Figure 1: From left to right: Divergence to a partition model with two blocks on 
X = {1,2,3}. Same, multiplied by a symmetric Dirichlet density with parameter 
a = 5. Divergence to the union of the three partition models with two blocks on 
X = {1, 2, 3}. Same, multiplied by the symmetric Dirichlet density with a = 5. 
The shading is scaled on each image individually. 

For a given N and < k < N/2 let be the set of all partitions of 
{1, . . . , N} into two blocks of cardinalities k and N — k. For different values of a 
and N we computed D(p\\A4^c 1 ) for 10 000 distributions sampled from Diiv aj ... ?a ), 
D(p\\A4y 2 ) for 20 000 distributions sampled from Dir( a? ... >a ), and D(p\\A4r N/2 ) 
for 20 000 distributions sampled from the uniform prior. The results are shown 
in Figure |2| 

In the first two cases the expected divergence seems to tend to the asymptotic 
value of (D(p\\u)). Observe that {D(p\\Mr 1 )) > (D(p\\Mr 2 )), unless N = 4. 
Intuitively this makes sense for two reasons: First, for Qi G Ti and Q2 G T2, 
using Theorem [4] one can show that (D(p\\A4 Ql )) > (D(p\\A4 Q2 ))] and second, 
the cardinality of T2 is much larger than the cardinality of Ti if N > 4. For 
small values of N this intuition may not always be correct. For example, for 
N = 8, the expected divergence from Mr N/2 is larger than the one from A^t 2 ? 
although in this case |T/v/ 2 | = 35 and IT2I = 28, see Figure [2] right. 

For N = 22 we computed D(p\\Mr N/2 ) for 500 uniformly sampled distri- 
butions (in this case \T N/2 \ = 352 716), and found (D(p\\Mr N/2 )) « 0.1442 
(with variance 0.0032), which is well below the corresponding expectation val- 
ues for A^Ti and A4y 2 . We expect that, for large iV, it is possible to make 
(D(p\\My k )) much smaller than (D(p\\u)) by choosing k w N/2. In this case, 
the model Mr k has (Hausdorff) dimension only one, but it is a union of expo- 
nentially many one-dimensional exponential families. 

A Computations and proofs 

The analytic formulas in Theorem [l] are [TOj Theorem 7]. The asymptotic 
expansions are direct. 

The proof of Theorem [2] makes use of the following Lemma, see [lOj Theo- 
rem 3]: 
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N N N 



Figure 2: Expected divergence (numerically) from various unions of bipartition 
models with respect to Dir( a jt .. ja ), for different system sizes N and values of the 
concentration parameter a. Left: Union of all bipartition models with blocks of 
cardinalities 1 and (N — 1). The y-ticks are located at h(a) — log(a) — 7, which 
are the limits of the expected divergence from single bipartition models, see 
Theorem [4] Middle: Union of all bipartition models with blocks of cardinalities 

2 and (N — 2). The peak at TV = 4 is caused by the fact that there are only 

3 different partitions when N = 4, instead of (^). The dashed plot indicates 
corresponding results from the left figure. Right: Comparison of the expected 
divergence from the two previous models and the union of all (^2) bipartition 
models with two blocks of cardinalities iV/2, for a = 1 and even N. 



Lemma 7. Let {Al, . . . , Ak} be a partition of X = {1, . . . , N}, let ai, . . . , aj\r 

be positive reals, and let a k = ^2 ieAk for k = 1, . . . , K. Then 



n n K 



f ( 5>o log ( y, tip?- 1 dp = / Pt logta) n (pt>r k '- 1 d p* 

Ja n-i ie A k ieA k i=l J&k-i k'=l 

(h(a k ) - h(a)) . 



r(a + l) 

Proof of Theorem^ The first statement follows from 

(ni + 1) 



-1 ^ 7 JAn-i i V iV ~*~ n / 

and Z>(p||g) = -i2"(p) - J^Pi log(g»). By Lemma[F| 

I log(ft) [] # d 4 / / II ^ ^ = ft(n<) - h(JV + n - 1) , 

and the remaining statements follow. □ 

Theorem [3] is a corollary to Theorem [I] the aggregation property of the 
Dirichlet priors and the formula ([I]) for the multi-information. Theorem[4]follows 
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from ([2]), and Theorem [6] follows from ([3|. Similarly, Theorem [5] follows from 
the equality 

D(P\\M ) = £ J2 **) ^ , 

i=l zGA* LLj = l{2^yeA i :y j =x j P\V)) 

which can be derived as follows: The unique solution q G arginf g / GM ^ D(p\\q') 
satisfies p(Ai) = q(Ai), and G arginf g , GAli 
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